Concept network

ABSTRACT

A concept network that can be generated in response to a user query. Various embodiments include analysis of structure information, for example, where such information is based at least in part on Universal Resource Locators (URLs) of Web sites or data storage locations. A concept network may be used with a search tool where the search tool searches a plurality of sites (e.g., Web sites, data storage locations, etc.). In such an example, each site location is arranged with a node. Certain ones of the nodes are connected by at least one link. The concept network selects a portion of certain ones of the nodes based on the link, wherein the at least one link is used for content purposes.

RELATED APPLICATIONS

This patent application is a divisional application of U.S. patentapplication Ser. No. 10/427,550, filed on May 1, 2003, entitled “ConceptNetwork” (now U.S. Pat. No. 7,406,459, issued Jul. 29, 2008). Theforegoing U.S. patent application Ser. No. (10/427,550) and U.S. Pat.No. (7,406,459) are incorporated by reference herein.

TECHNICAL FIELD

This disclosure relates to search tools, and more particularly todisplayed searched results.

BACKGROUND

With the rapid growth of such networks as the Internet, the accuracy andquality of searches becomes more and more important. However, many usersfind that searching using search engines yields a large number (perhapsthousands) of results, many of which are not closely applicable to theirsubmitted query. As such, many users become dissatisfied with the searchresults. Some users also find that the large number of returned resultsfor queries obscure important information contained in the Internet.

Most prior-art search engines are primarily based on a keywordcomparison. Consider a query asking for the top N digital cameramanufacturers in the world, where N is an integer. Keyword comparisonsearch engines would return some Web pages that contain the key term“digital camera” and other Web pages that contain the key term“manufacturers”. Therefore, the percentage of the total returned resultsthat relate to digital camera manufacturers that are returned in keywordcomparison search engines is relatively small. The keyword comparisonsearch engine also has no way to compare whether a particular digitalcamera manufacturer is larger or better known (or some otherquantifiable comparison) than another digital camera manufacturer basedon their Web pages. As such, prior-art search engines, being primarilybased on keyword comparisons, often lead to the large number of resultsmany of which are marginally related to the query. Such keywordcomparison search engines cannot identify the most applicable ones of aplurality of searched Web sites based on the structure of the Web sites.

In another aspect, many users believe that they have to search through alarge number of queries to obtain useful search results. As such, theusers believe that the queries (and the examination of the searchresults for relevancy) demand a considerable amount of time to ensurethat all relevant responses are considered. Even after such time isspent, the users often believe that the most significant search resultsmay be lost within a vast amount of irrelevant information.

In yet another aspect, many Internet applications utilize suchlexicography tools as WordNet® ((developed at Princeton University underthe direction of Prof. George A. Miller) to expand the user's query toimprove the precision of the search engine. WordNet is an online lexicalreference system. With WordNet, nouns, verbs, adjectives and adverbs areorganized into synonym sets, each representing one underlying lexicalconcept. Different relationships link the synonym sets. With WordNet,users manually input their personal taxonomy relative to Web pages.Therefore, WordNet is not suitably configured to keep up with the rapidgrowth and dynamic changes of Internet and other networked computersystems. For example, over half of the words in the Web do not appear inWordNet.

SUMMARY OF THE INVENTION

This invention relates to a concept network. The concept network can begenerated in response to a user query. In one embodiment, the conceptnetwork is being used with a search tool. The search tool searches aplurality of data storage locations. Each data storage location isarranged with a node. Certain ones of the nodes are connected by atleast one link. The concept network selects a portion of certain ones ofthe nodes based on the link, wherein the at least one link is used forcontent purposes.

BRIEF DESCRIPTION OF THE DRAWINGS

Throughout the drawings, the same numbers reference like features andcomponents.

FIG. 1 is a block diagram of one embodiment of a computer environmentincluding a concept network;

FIG. 2 is a block diagram of another embodiment of a Web site searchportion that develops one embodiment of the concept network;

FIG. 3 illustrates a block diagram of one embodiment of taxonomyconstruction for a domain that is used to establish a concept network;

FIG. 4 illustrates a perspective view of displayed results of a conceptnetwork as it may appear on a display of a computer environment;

FIG. 5 illustrates a flow chart of one embodiment of a Web SiteStructure Analysis algorithm; and

FIG. 6 illustrated a block diagram of one embodiment of a computerenvironment that may be used to derive or display a concept network to auser.

DETAILED DESCRIPTION

This disclosure provides a variety of embodiments of concept networks.In concept networks, the query is equated to a concept that is beingsearched. In concept networks, a plurality of search result informationis structurally organized into a plurality of concepts that is displayedto the user. Concept networks, as disclosed herein, retrieve and/ordisplay search results (the search results are arranged based on theconcepts) according to the relevancy of the search results to thevarious concepts of the query. Concept networks can be configured toallow users to access a variety of search results, a variety of contentsof the search results, or a variety of portions of search results basedon the relevancy of the search results to the user query. Such conceptnetworks are generated in computer environments based on the query. Oneaspect of the term “concept network” relates to the grouping of conceptsinto the concept network in a fashion such that the concepts can beunderstood and accessed by the user.

One embodiment of concept networks are particularly directed to theInternet, even though concept networks in general can be applied to anycomputer environment or computer system. In the Internet embodiment of aconcept network, a user may input a query, and the displayed outputconcept network might be a list of concepts that can be selected by theuser. For example, if a user queries “electronic systems”, the displayedconcept networks may include a variety of concepts such as cellulartelephones, computers, audio systems, video systems, etc. The user couldthereupon select one of these concepts displayed as the concept networkto display more specific search results.

One embodiment of the concept network includes a large connection graphdisplaying multiple interconnected concepts such as described relativeto FIG. 3. The concept network, as with search results from prior-artsearch engines, is derived based on user queries. Concept networksincrease the accuracy of the response to a user query compared to priorart search engines. In addition, concept networks limit the large numberof extraneous search results that are prevalent among prior-art searchengines, the latter which rely on keyword queries.

FIG. 1 illustrates a block diagram of an embodiment of a computerenvironment 50 that is configured to generate and display the conceptnetwork 100. The computer environment 50 can include an optional networkportion 72 (although the computer can also be a stand-alone computer).The computer environment 50 includes a search tool 74 and a display tool75. Portions of the search tool 74 and the display tool 75 includeportions of one embodiment of the concept network 100. The conceptnetwork 100 is provided as a user interface by which a structuredrepresentation of the search results is displayed to the user, and thesearch results are structured, or arranged, according to concepts thatcan also be illustrated to the user to indicate the relevancy of eachreturned search result. Whereas prior-art search engines typicallyreturn a listing of applicable Web pages or the like, the returnedconcept network as disclosed in this disclosure includes, e.g., aplurality of Web pages structurally arranged according to their content.The search to generate the concept network 100 relies on the informationcontained within the searched data (e.g., a Web page), as indicted bysome structural feature of the searched data. As such, the returnedresults of the concept network (that is based on structure of thesearched data) generally provide more accurate search results than thatof prior-art search engines (that are based on keyword matching).

The FIG. 1 computer environment 50 that includes the concept network 100illustrates a generalized computer environment. It is envisioned thatthe concept network is highly applicable in any type of computerenvironment that could use search engines including stand-alonecomputers, networked computers, or mainframe computers. However, thisdisclosure is described as being applicable to particular embodiments ofcomputer environments. More particularly, it is envisioned that theconcept network 100 is applicable to networks. Even more particularly,it is envisioned that one embodiment of the computer environment 50 thatincludes the concept network 100 includes a variety of Web pagesarranged on Web sites across the Internet. As such, certain embodimentsof the concept network rely on servers that provide data forming thesearch results to the clients in networked computer environments such asthe Internet. One embodiment of the structured representation of thesearch results that are displayed to the user is based on the UniversalResource Locator (URL) that is generally known to users of the Internet.The structural aspects of the URL that are used in certain embodimentsof the Internet are described in this disclosure. While this disclosuredescribes the URL as providing structure to the data that is relied onin the concept network, it is emphasized that any other mechanism thatcan contain similar structural information that can be relied upon by aconcept network is within the intended scope of the present disclosure.

In certain embodiments of the computer environment 50, a user submittinga query will result in a related concept network 100, in which thedisplayed results are organized into a series of related concepts. Ingeneral, a large variety of types of search results can be obtainedbased on a large variety of user queries. As such, the presentdisclosure describes the generation of a variety of concept networksbased on a large number of user queries. One illustrative example of aconcept network is yielded by searching for the “top N” queries (thosequeries that ask for the top “N” of any category where N is an integer).Another illustrative example of a concept network yields meaningfulresults in searches for a complex concept, such as “explainelectronics.” The concept network 100 is generated based on the concepts(not as with prior-art search engines that produce search results basedon the actual keywords provided by the user's query). Concept networks100 are generated in response to queries in a manner that they canprovide more detailed and accurate information to the user. Certainembodiments of concept networks, for example, are generated consideringa considerable number of Web sites that relate to the concept posed bythe query. The concept network considers the relevancy of each Web siteto the concept provided by the query, and then relevant portions of avariety of Web pages are illustrated over the concept network to theuser.

The concept network 100 can display results from a variety of queries ina more organized and accurate fashion than prior-art search engines thatrely on keyword query results. The fact that more directed and accuratequery responses are included allows the concept network to deal withfewer Web pages in the display to the user. The relatively few Web pagesthat are generated by the concept network can then be organizedaccording to their structure. For example, those Web pages that relateto one type of concept can be accessed through one portion of theconcept network while those Web pages that relate to another type ofconcept can be accessed through another portion of the concept network.The concept network 100 therefore can contain the structure informationrelating to a large amount of retrieved information (such as Web sites,Web site content information, or portions of Web pages).

The concept network 100 provides a number of improvements over listingsof Web pages that are returned by most prior-art search engines. Certainembodiments of concept network 100 instead provide and displaystructured information that is arranged in an order on a Web page (thatdisplays the concept network). The automatically generated and displayedconcept network 100, as illustrated in FIG. 4, is in a form that can bereadily understood, interpreted, and is more useful to the user. Theconcept network 100 generally improves the precision and speed ofsearches, as well as increases the relevancy of the information obtainedduring these searches compared to prior-art search engines.

Concept networks 100, in general, display a considerable amount ofinformation that is derived based on the structural information (e.g.,format, links between nodes, etc.) of the data. In one embodiment, thisstructural information is obtained based on the Universal ResourceLocator (URL), although any device that contains structural informationfor the retrieved information can be used. In the Internet, the URL iscurrently used for navigational purposes to allow browsers to accessparticular Web pages over the Internet. The URL can also be used toprovide structural information (that describes the relationship betweendifferent nodes) that is used as described in this disclosure to createconcept networks. Examples of such structural information involve, forexample, one node being an ancestor, an offspring, a sibling, or someother relation to another node. Such structural information is used by avariety of embodiments of concept networks 100 of structurally describethe relationship between different nodes within the concept network.

Such structural information is used in the concept network 100 toprovide a taxonomy, or classification, of words. The taxonomy of theconcept network (as with prior-art search engines) relates to themeanings of particular words. Prior-art manual search engines havedifficulty maintaining a current taxonomy, considering the large numbersof words that have changed meanings or are being added or removed withinthe search engine. Certain embodiments of the concept network provide anautomatically constructed taxonomy that is adaptive to domains and usersbased on the structure of the Web sites accessed during the query. It isenvisioned that concept networks 100, as disclosed herein, can beapplied to a large variety of computer systems including, but notlimited to: databases, online shopping, cameras, personal computers,handheld computers, machine learning, and computer manufacturing.

While this disclosure describes the concept network 100 being applied toanalyze Web sites on the Internet, it should be emphasized that theseconcepts are applicable to all networked, stand-alone, and othercomputer-based search engines. As such, the application of the conceptnetwork to the Internet, or any network or computer system is within theintended scope of the present disclosure.

The present disclosure describes a variety of embodiments of the conceptnetwork 100, and associated components. The concept network 100 isdesigned to automatically keep itself up to date without any need forupdating on behalf of the user. Between queries, one embodiment of thecomputer environment continually searches in a similar manner thatkeyword searches cache popular searches such as by using a Web sitecrawler. One embodiment of the concept network 100 will crawl all theWeb sites relating to the collected concepts to keep the concept networkup-to-date. The crawling process is envisioned to be similar to thoseprocesses performed by traditional search engines.

The concept network 100, within a reasonable amount of time, is able tounderstand a large number of keywords of typical usage (including theirstructure) based on the taxonomy produced with the concept network.Using the taxonomy, the concept network displays the keywords in astructured fashion. As such, the concept network is capable of beingused as a thesaurus since the concept network is able to interpret themeanings of words based on the taxonomy. The increased number of wordsin the taxonomy (i.e., dictionaries) of the concept network is thereforeespecially useful to users that are searching a computer environmentsuch as a network or the Web for a specific technical, legal, or othersuch specialized word.

Almost all professions have a considerable number of specialized words,many of which are being continuously updated over the years. Forexample, such professions and groups as attorneys, tax experts,engineers, etc. each have their own taxonomy based on their particularfield of use and expertise. The manual search engines do not update manyof these terms because of a relatively small number of users for each ofthese areas. The concept network can automatically update many of theseterms that are in specialized, uncommon, or frequently updated usage.

One embodiment of the Web site search portion 201 that is used to derivethe concept network 100 is described in FIG. 2. The embodiment of theWeb site search portion 201 includes an entrance page and crawler rulesportion 202, a Web site structure analyzer 204, a Web page summarizationportion 206, a Web site structure merging tool 208, and the conceptnetwork 100. The Web site structure analyzer 204 includes a hyperlinkqueue 212, a Web site crawler 214, an HTML parser 216, a Function-basedObject Model (FOM) analyzer 218, and a hyperlink analysis 220.

To produce a concept network 100, the Web site structure analyzer 204analyzes the structures of Web sites. Then a web merge tool (alsoreferred to herein as the Web site structure merging tool 208 of FIG. 2merges contents from different ones of the structuralized Web sites toyield the search results that can be displayed using the conceptnetwork.

Links are used to navigate in traditional Web sites. To analyze the Website content structure in order to create each concept network 100, thelink is converted from being used for navigation to being used forcontent. To do this conversion, the following steps are performed.

a) The structured information for each Web site is encoded in the URL.As such, a particular link is encoded in the URL irrespective to whetherit is an upward link, a downward link, a sibling link, or a crosswiselink. This is not done for prior-art search engines. In one embodiment,this distinguishing of the type of link is performed by the Web sitecrawler 214 by considering the visiting sequences of the Web sitecrawler.

b) An aggregation and association analysis is performed. Thisaggregation and association analysis includes determining the locationsof the hubs and the different authorities. In one embodiment, this canbe performed by the FOM Analyzer 218.

c) The information link and the navigation link are then distinguished.This identification is performed using a function-based object mode(FOM) to analyze the navigation bar, the navigation list, or theindependent link. As such, the page layout is used to segment the Webpage. In one embodiment, c) can be performed using the FOM Analyzer 218.

While prior-art search engines provide access to multiple Web sites on aone-at-a-time basis, the concept network 100 is formed to containstructural information obtained from a variety of Web sitessimultaneously. The information from this variety of Web pages can beorganized in a manner on the concept network 100 that can be easilyunderstood by the user. More particularly, similarly structuredinformation from multiple Web sites can be displayed in the conceptnetwork 100 in a manner that presents quantifiable values from thestructural information of the multiple Web pages (often based on theURL). Such structural information from the multiple Web pages can thenbe presented in a manner leading to comparison between the subjects ofthe different Web pages. For example, multiple companies or groups thatdeal with a particular industry or topic are likely to contain similartypes of information in their Web pages in a similar structure. Theconcept network provides a vehicle to display this similar informationfrom the different Web pages; or alternatively presents the differentbut related Web pages to be presented to the user in a manner thatpermits easy accessing of the different Web pages from the same conceptnetwork.

In certain embodiments, the Web site structure analyzer 204 accepts asinputs the enter-point URL of a Web site and some Web site crawler rulesfrom the entrance page and crawler rules portion 202. The URLs contain avariety of structural information that relates to a particular Web page,(e.g., end points of the links, the type of Web page, etc.). Thisstructure provided by the URL is not utilized by traditional searchengines for deriving structural information relating to the Web pages.The Web site structure analyzer 204 analyzes the Web site structure andassigns depth information to the Web pages. As a result, one embodimentof the Web site structure analyzer 204 generates a hierarchy graph ofthe Web site, whose nodes include concepts. The concepts derived by theconcept network may be characterized by keywords as described in thisdisclosure. The Web site structure analyzer 204 leads to the use of astructuralized Web site.

One embodiment of the Web-Site Structure Analyzer 204 is based on BFS(Breath First Search) algorithm. The Web-Site Structure Analyzer 204maintains the Hyperlink Queue 212. The Web site crawler 214 fetches aURL from the hyperlink queue 212, then crawls the Hypertext MarkupLanguage (HTML) source code from the Internet using the Web site crawler214, and then forwards the HTML source code to the HTML Parser 216. Thehyperlink queue 212 is a queue including unanalyzed hyperlinks. Beforethe analysis begins, the Web site structure analyzer 204 attaches theenter-point URL. During the analysis, only the Web site crawler 214fetches the URL from the hyperlink queue 212. Only the hyperlinkanalyzer 220 applies new unanalyzed hyperlinks.

The enter-point URL of a Web site enters the Hyperlink Queue 212 of theWeb site structure analyzer 204 from the entrance page and crawler rulesportion 202. When the Web-Site Structure Analyzer 204 begins itsanalysis, the Web site Crawler 214 fetches the URL from the HyperlinkQueue 212, then the Web site Crawler crawls the HTML source code fromthe Internet, and forwards the HTML source code to the HTML Parser 216.The HTML parser processes the HTML source code crawled from theInternet.

The HTML Parser 216 accepts the HTML source code that is input from theWeb site crawler 214. In one embodiment, the activities of the HTMLparser 216 includes URL fetching, URL unification, and URL grouping. ForURL fetching, the HTML parser 216 fetches all URLs that point to a Webpage and are inside the Web site according to the input Web sitedefinition. Every URL is attached with anchor text. For image link, theanchor is surrounding text.

For URL unification, one embodiment of the HTML parser 216 performs avariety of operations including: a) converting a relative URL address toa direct URL address; b) changing an IP address to a domain name; and c)solving the redirected URL problem by substituting the URL with thefinal target URL address. For URL grouping, the hyperlinks in a table orlist that have the same tag elements and same appearance are likely tobe considered, for example, as related nodes. The results from the HTMLparser 216 are then forwarded to the Function-based Object Model (FOM)Analyzer 218.

The Function-based Object Model (FOM) Analyzer 218 uses the basic idealand algorithm of FOM to assign functional information to hyperlinks.This functional information is very useful for analyzing the structureof each Web site. FOM represents a function-based object model for a Webpage. Instead of semantic analysis, the FOM Analyzer 218 attempts tounderstand the authors' intention by identifying each object functionand category. Each Web page can function as an index page or a contentpage. One category of the navigation object is the navigation bar. Oneembodiment of the FOM Analyzer 218 performs Index/content pagerecognition and Navigation bar detection as the following FOM analysistasks.

For the Index/content page recognition, one embodiment of the FOManalyzer 218 determines whether the Web page URL includes the text“index” or “default”, and whether the URL is a directory or whether itis an index page. If there is a link inside the page that corresponds tothe subdirectory, this link is to the index page. The ratio of thehyperlinks and content words are compared to a threshold value. If theratio is greater than the threshold value, the Web page is anindex-page. If the threshold value is greater than the ratio, the Webpage is a content-page.

One embodiment of the FOM analyzer 218 provides navigation bardetection. The items in a navigation bar are inter-connected with eachother and the corresponding link topology is a completely connectedgraph. The output of the FOM analyzer 218 includes a plurality ofhyperlinks that are forwarded to the hyperlink analyzer 220. The FOManalyzer 218 provides a block segmentation for a Web page. In oneembodiment, after segmentation, a Web page is divided into several smallunits based on their functions, such as content block, navigation block,advertisement block, etc. These small units can be individually accessedby the user.

One embodiment of the Hyperlink Analyzer 220 uses a Web site structureanalysis algorithm to handle each hyperlink analyzed by (and transmittedfrom) the FOM analyzer 218. The parsed source code is forwarded to theFOM analyzer 218 to perform functional analysis. The Hyperlink Analyzer220 analyzes each hyperlink according to Web-Site Structure AnalysisRules and the new unanalyzed hyperlinks are inserted into the HyperlinkQueue 212. The Hyperlink Analyzer 220 assigns a depth value to each Webpage (and maintains the temporary hierarchy graph of Web site). Thedepth value can be outputted by the Web site crawler 214. In oneembodiment, the Web site crawler 214 visits the Web site by breath-firstsearch. The traveling path will be formed as a tree format, the node ofthe tree is the Web page, and the links within the nodes are thehyperlinks within the Web pages. So the depth of a node in the tree isthe value which we wanted to obtain. For example, the depth for an entrypoint Web page (such as the entry point page identified by the URLhttp://www.microsoft.com) is 0. The depth for the Web page identified bythe URL http://www.microsoft.com/china, by comparison, is 1.

The Web site structure analyzer 204 forms a loop that can be consideredas starting and ending at the hyperlink queue 212. The Web site Crawler214 fetches the next URL from the hyperlink queue 212 to begin the nextloop. This is performed until the hyperlink queue 212 is empty of newURLs. The analysis process is accomplished and the hierarchy graph ofWeb site (called the structuralized Web site) is constructed.

The structured information for each Web site is encoded in the URL in amanner that can be detected using the hyperlink analyzer 220. As such,whether a particular link is an upward link, a downward link, a siblinglink, or a crosswise link it is encoded in the URL (and can be detectedusing the hyperlink analyzer 220). In one embodiment, a heuristic rulebased on URL block-length is used to detect an Upward-link and aForward-link. A URL block-length is defined as a number of blocks, ablock is a part of the URL separated by “/” or “?”. For instance, theURL block-length of the URL“http://www.sonystyle.com/digital/digital_camera.htm” is 3 including“http://www.sonystyle.com”, “digital”, and “digital camera.html”. In oneembodiment, the restricted rules are applied to analyze the URLs. Then,for the rest URLs which not covered by rules, the above strategies areused to analyze). One embodiment of a Hyperlink Detecting Rule isdescribed according to two rules. The first rule is that if the URLblock-length (hyperlink) is less than or equal to the URL block-lengthof the Web page, then the hyperlink is an Upward-link. The second ruleis that if the URL block-length (hyperlink) minus URL block-length (URLof the Web page) is greater than or equal to 2, then the hyperlink is aForward-link.

Suppose the current Web page node is B, which has a hyperlink to Webpage C. The hyperlink analyzer portion 220 of the Web Site StructureAnalyzer 204 follows this process:

-   -   I. if the hyperlink is upward link, it is dropped (not further        considered).    -   II. If B and C belong to a navigation bar, then B and C are        sibling nodes (as described herein).    -   III. If C has been visited and the URL block-length of B is        greater or equal to C:        -   If B is an index page; then C is B's child node (as            described herein);        -   Else if B is a content page, then C is B's sibling node.    -   IV. If C has not been visited,        -   If B is a content page, then C is B's sibling node,            -   Else C is B's child node.            -   Else if C hasn't been accessed, then            -   First if B is a content page or displayed in several                pages, the link is explicit association.                -   Otherwise the link is an aggregation.

After analyzing the URL in the hyperlink queue, the Web site structureis derived using the Web page summarization portion 206. For instance, acertain amount of data contained in a Web page may be relevant to aparticular user's query, while other data is not relevant. The Web pagesummarization provides the relevant information in a form that can bedisplayed over a particular concept section within the concept network100. Since the entirety of each Web page is not illustrated over theconcept network, the concept network can provide a more directed summaryof the information of each concept or Web page that can be accessed bythe user. The varied contents of the different Web pages (or othercontents) that are derived from the Web page summarization portion 206are then merged into the concept network 100 using the Web sitestructure merging tool 208. The Web site structure is represented with ahierarchy graph.

Certain embodiments of the concept network 100 analyze the structure ofrelevant Web sites, and thereupon merge the results together. Thismerging of the information from a plurality of Web sites is referred toin this disclosure as a web merge as performed by the Web site StructureMerging Tool 208 as illustrated in FIG. 2. The web merge performed bythe Web site Structure Merging Tool 208 improves the precision and speedof the concept network and is performed as follows.

After each Web site is structuralized into the “tree-like graph” or“depth-level graph”, the next problem is to merge these graphs into anetwork. In the network, each node represents one concept and the linksbetween these nodes represent the relationships between these concepts.The basic relationships may include, but are not limited to hypemyms,hyponyms, synonyms, etc. Since each Web site represent the originaleditor's view on the related topics, it is a little difficult to mergethe different views into one view. So in the following, we give asolution to merge the concept hierarchy from all kinds of sources intoone usable hierarchy.

To illustrate one embodiment of how to merge the hierarchy of theconcept network, one kind of relationship R for a given concept C ismerged from two different hierarchies H. A detailed algorithm to solvethis problem follows:

The following technique represents one embodiment that can be used toperform the ontology merging procedure:

-   -   a) For each Web block, the concepts are summarized for a Web        page using the Web page summarization portion 206 as shown in        FIG. 2. The concepts are interpreted as a set of keywords.    -   b) The concepts are then tokenized by which each concept that is        to be generated and displayed over the concept network 100 is        represented by a “token” phrase or keyword. As such, a set of        keywords are established to represent and describe the concepts        contained in the concept network. (1) is used to eventually        yield the concepts:        n_(i)=[w_(i1),w_(i2), . . . ,w_(im])  (1)    -   where w_(i1), w_(i2), . . . , w_(im) represent words, and n_(i)        represents the array of words n_(i) is the summary for a node        (Web page) in concept network, it can be decomposed into several        words/phrases, i.e. w_(i1), w_(i2), . . . , w_(im)        -   c) A gliding window is provided over the hierarchy tree to            generate the sub-tree ST of the offspring, ancestor, and            sibling respectively using (2), (3), and (4). It is assumed            that some words appear in different windows.            ST _(i)(offspring)=(n _(i),sons₁(n _(i)), . . . ,sons_(d)(n            _(i)))  (2)            ST _(i)(ancestor)=(n _(i),parents₁(n _(i)), . . .            ,parent_(d)(n _(i)))  (3)            ST _(i)(sibling)=(n _(i),sibs₁(n _(i)), . . . ,sibs_(d)(n            _(i)))  (4)

where, ST_(i)(offspring), ST_(i)(ancestor), and ST_(i) (sibling) is thesub-tree for calculating the offspring, ancestor and siblingrelationship; sons_(d), parents_(d) and sibs_(d) stands for the d^(th)level's son nodes, parent nodes and sibling nodes for node n_(i)separately.

-   -   d) For each generated sub-tree (e.g. ST_(i)(ancestor)), the        mutual information of a term-pair is counted as Equation (5).        The mutual information MI for each word pair w_(i), w_(j) is        calculated. The mutual information having a high value indicates        that the pair of words are similar.

$\begin{matrix}{{{MI}\left( {w_{i},w_{j}} \right)} = {{P_{r}\left( {w_{i},w_{j}} \right)}\log\;\frac{P_{r}\left( {w_{i},w_{j}} \right)}{{P_{r}\left( w_{i} \right)}{P_{r}\left( w_{j} \right)}}}} & (5) \\{{P_{r}\left( {w_{i},w_{j}} \right)} = \frac{C\left( {w_{i},w_{j}} \right)}{\sum\limits_{k}{\sum\limits_{l}{C\left( {w_{k},w_{l}} \right)}}}} & (6) \\{{P_{r}\left( w_{i} \right)} = \frac{C\left( w_{i} \right)}{\sum\limits_{k}{C\left( w_{k} \right)}}} & (7) \\{{P_{r}\left( w_{j} \right)} = \frac{C\left( w_{j} \right)}{\sum\limits_{k}{C\left( w_{k} \right)}}} & (8)\end{matrix}$where, MI(w_(i), w_(j)) is the mutual information of term w_(i) andw_(j); Pr(w_(i), w_(j)) stands for the probability that term w_(i) andw_(j) appears together in the sub-tree; Pr(x) (x can be w_(i) or w_(j))stands for the probability that term x appears in the sub-tree.

Another factor to determine the relevance if a pair of terms is thedistribution of the term-pair. The more sub-trees contain the term-pair,the more similar the two terms are. In our implementation, entropy isused to measure the distribution of the term pair, as shown in step (d)

-   -   d) Calculate the entropy for each word pair w_(i), w_(j). The        entropy conversion is a measure of the pair w_(i), w_(j) of        words that were determined to be common, based on the mutual        information determined in (5) actually being common in all of        the Web sites. The higher the entropy is, the more confidence        that the concept network can provide to the user that the word        pairs among all of the Web sites.

$\begin{matrix}{{{entropy}\mspace{14mu}\left( {w_{i},w_{j}} \right)} = {- {\sum\limits_{k = 1}^{N}{{P_{r}\left( {w_{i},w_{j}} \right)}\log\;{P_{r}\left( {w_{i},w_{j}} \right)}}}}} & (9) \\{{p_{k}\left( {w_{i},w_{j}} \right)} = \frac{C\left( {w_{i},{w_{j}❘{ST}_{k}}} \right)}{\sum\limits_{l = 1}^{N}{C\left( {w_{i},{w_{j}❘{ST}_{l}}} \right)}}} & (10)\end{matrix}$

-   -   e) Calculate the similarity Sim for each word pairs as per (11):

$\begin{matrix}{{{Sim}\left( {w_{i},w_{j}} \right)} = {{{MI}\left( {w_{i},w_{j}} \right)} \times \frac{{{entropy}\mspace{14mu}\left( {w_{i},w_{j}} \right)} + 1}{{\alpha log}(N)}}} & (11)\end{matrix}$The similarity as set forth in (11) combines mutual informationMI(w_(i), w_(j)) and entropy (w_(i), w_(j)).

To indicate the related concepts (offspring, ancestor, and sibling) thatrelate to (2), (3), and (4), the concepts network produces a variety ofrelated categories. For instance, Table 1 illustrates a variety ofexemplary offspring concepts for well known concepts:

TABLE 1 Offspring Concepts Original Category Offspring Software utility,game, business, general, graphic, database Video DVD, TV, projection,camcorder Fiction story, drama, horror, poetry, science, romance Apparelclothing, women's, shirt, shoe, accessory, men's, sport, costume,children's Shoe boot, heel, sandal, slipper, casual Pet care, supply,bird, cat, dog, fish, food, service

Table 2 illustrates a variety of exemplary Ancestor concepts:

TABLE 2 Ancestor Concepts Original Category Ancestor Software ComputerVideo electronics, component Fiction book, literature Apparel Notapplicable Shoe women's, man's, apparel Pet Not applicable

Table 3 illustrates a variety of exemplary Sibling concepts:

TABLE 3 Sibling Concepts Original Category Sibling Software hardware,network, apparel, storage, peripheral, memory Video audio, photography,camera, accessory Fiction cookery, history, sport, travel, author, comicApparel Fashion, software, beauty, music, pet Shoe Clothing, watch,coat, shirt, swimwear, pant Pet Gift, sports, toy, jewelry, book

One embodiment of the concept network 100 as illustrated in FIG. 2 isprovided as a directed graph that is illustrated in structural form inFIG. 3, and in a form that it may appear to a user in FIG. 4. Thedirected graph (G) 300 that the concept network is based on is describedby (12):G=(V,E)  (12)Where V is a collection of nodes and E is a collection of edges orlinks. As such, the concept network 100 as represented by the directedgraph includes a plurality of nodes and a plurality of links or edgesconnecting the nodes. The nodes represent concepts. The edges or linksrepresent the relationships between the concepts. The directed graph 300as shown in FIG. 4 of the concept network 100 thereby provides thecontent structure. The content structure of the Web pages is informationmined to yield information that is used to produce the concept network.

FIG. 3 illustrates one embodiment of a technique for constructing ataxonomy for a particular domain using the concept network 100. FIG. 3starts with a derivation of one or more domain specific Web sites 302.This can be accomplished by leveraging an existing meta-search engine todo this job. For example, if a user desires to construct the conceptnetwork for “digital camera” domain, the user can send the query tosearch engine and use the top 100 Web sites to construct the conceptnetwork. Each Domain Specific Web site 302 includes structurecorresponding to an analysis of content (represented by the nodes) andan analysis of the link structure (represented by the link structure).

Producing the concept network 100 relies on efficient mining of thecontent structure of one or more Web sites. This mining can be performedby analyzing the link type, that determines whether the link is anoffspring link, an ancestor link, or a sibling link such as describedrelative to the hyperlink analyzer 220 of FIG. 2. One of these linktypes is assigned to each link. The semantic of the node is thensummarized using the Web page summarization portion 206 as shown in FIG.2. In FIG. 3, the domain specific taxonomy is derived based on thisinformation mining. Note that the derivation of the domain specifictaxonomy is performed automatically in the present disclosure, comparedto such prior-art tools as WordNet® that require manual editor input fortaxonomy. Wordnet is a manually constructed taxonomy for general domain.This taxonomy is constructed by editor instead of end-user. Theinformation mining relies on the link structure and the content of thedomain specific Web sites. This differs from certain prior-art automaticthesaurus constructions in which the information is mined from thecontent instead of link structure.

The concept network 100 is then constructed using ontology learning.Based on the ontology learning, the automatically constructed conceptnetwork develops its own taxonomy. The ontology learning is based on astatistical framework, and is capable of yielding multiple editors'views. The statistical framework is easily applied to many statisticalapplications. The concept network 100 that is constructed as shown inFIG. 3 describes a variety of concept networks for electronics. Theconcept network 100 includes a variety of Web blocks 450, with each Webblock representing a different category of electronics (e.g., anelectronics product, an electronic category, and electronic devicemanufacturer, etc.).

Each Web block is described by a keyword which is recognizable by users.Each sub Web block 454 can be considered as related to the primary Webblock. For example, in FIG. 3, the word “electronics” represents theprimary Web block 452. The term “electronics” represents a good primaryWeb block 452 because this term appears in many Web sites pertaining toa variety of products (each of the variety of products may be consideredas a sub Web block). For example, in FIG. 3, a variety of sub Web blocks454 (including cameras and photos, audio and video, handheld, cellphones, computers, Sony®, iPAQ®, Palm®, Accessories, and a variety ofCompaq® products) are illustrated under the electronics primary Webblock. Each web block is considered to be a concept that containshomogenous information within this disclosure. The term “conceptnetwork” therefore describes a network of multiple concepts, or Webblocks.

Each Web block can be summarized by a keyword (such as camera,computers, and “Sony” as illustrated in FIG. 3). The subject of each subWeb block in FIG. 3 relates heavily to the primary Web blockelectronics, and can therefore be classified broadly under the concept“electronics”. Based on the structure of the Web blocks, the mining, andthe domain specific taxonomy of the concept network 100, the conceptnetwork for electronics as illustrated in FIG. 3 contains many of theseterms. The yielded concept network 100 illustrated in FIG. 3 may beconsidered as the end result that is automatically constructed.

One embodiment of an illustrative concept network 100, as it wouldappear on a computer display 200 such as a flat panel display or a CRTmonitor, is illustrated in FIG. 4. As such, FIG. 4 illustrates thegenerated concept network 100 (using the techniques illustrated in FIGS.2 and 3) including a variety of concepts 402. Each concept 402 includesinformation pertaining to at least one of the Web blocks 450 that hasbeen generated in the manner, certain embodiments of which areillustrated in FIG. 3. The concept network 100 illustrated in FIG. 4therefore contains a number of concepts 402 that are tiled on thedisplay. The details of the concept network are relatively detailed asto the area of interest (in this instance “electronics”). For instance,several of the concepts, if selected by a user, bring the user toanother concept network that may be narrower or broader than the conceptnetwork currently displayed. For instance, a user can transfer from anelectronics concept network to a computer concept network.

An analysis of the concept network was performed by searching through avariety of Web sites. The analysis indicated an improvement (to 75percent) in the percentage of Web sites that were correctly located incertain implementations of the concept network. This represents aconsiderable improvement over the prior-art as far as accuracy isconcerned.

Consider the illustrative query “digital camera manufacturer”. Typicalprior-art search engines search the entire web and return those Webpages that contain the key terms “digital” and/or “camera” and/or“manufacturer”. Such prior-art search engines will therefore return aconsiderable number of unrelated Web pages.

The concept network 100 needs only to search a sub-graph that expandsfrom the node “digital camera”. As such, the concept network is faster,and the number of returned irrelevant Web pages is reduced considerably.

The concept network 100 increases the ease, speed, and reliability ofthe desired response to the query. First, the term “digital camera” islocated in the concept network 100. All the nodes that are pointed to orpointed from the node “digital camera” are extracted. Then the nodeswhose properties are “manufacturer” are selected, and ranked (e.g.,based on hit numbers). As such, a query for the top N of any category(largest company, largest producer, most offices, nearest locations,etc.) of Web pages can be searched, and the probability of netting areasonable number of accurate hits is greatly increased.

Such improved searching characteristics by the concept network occurbecause the query is directed at the structure of the searched Web sites(as contained within the URL). Certain embodiments of the conceptnetwork 100 as described relative to FIG. 5 can provide a variety ofsearch services that can search for some quantifiable parameter of thetop “N” (where “N” is some positive number) organizations, companies,items, groups, products, etc. as listed on Web sites on the Internet.For example, certain embodiments produce a concept network 100 thatprovides a search result for a query to find out the top five digitalcamera manufacturers in the world. Another query provides the searchresults for another complex query such as indicating the top five steelproducing companies in Europe. One type of query in which the conceptnetwork is intended to be highly beneficial relies on accessing databased on the structure of the Web sites (e.g., based on the structureprovided by the URL). The “top N” type query analyzes and returnsinformation based on the structure of multiple Web sites. For example,one technique is to determine who are the top three automobile producersin the United States involves accessing the Web sites of all of thepotential automobile producers, deriving similar production informationfrom each Web site, and then comparing the derived productioninformation from the different Web sites. As such, certain embodimentsof the concept network 100 can search for detailed features within a Webpage.

Data mining is directed to such Web site analysis. Generally, datamining (sometimes called data or knowledge discovery) is the process ofanalyzing data from different perspectives and summarizing it intouseful information to the user based on the query. Data mining softwareis one of a number of analytical tools for analyzing data. It allowsusers to analyze data from many different dimensions or angles,categorize it, and summarize the relationships identified. Technically,data mining is the process of finding correlations or patterns amongdozens of fields in large relational databases, and is generally wellknown in queries. As such, certain embodiments of the concept networkcan use data mining 306 as provided by FIG. 3 to derive a domainspecific taxonomy 304.

FIG. 5 illustrates one embodiment of a process 600 that results inproducing the concept network. The process 600 includes 602 in which auser inputs a query into the computer environment 50 (as shown in FIG.1). The query will result in the concept network being produced anddisplayed to the user. The query is submitted to a plurality ofdomain-specific Web sites 302 as described relative to FIG. 3 in 604.These Web sites are returned by a popular meta-search engines orhuman-built web hierarchy. In 606, the computer environment analyzes theWeb site structure, such as by considering the URLs of the associatedWeb sites. In 608, information is mined based on the structure andcontent of the Web sites. The information mined is used to produce thedomain specific taxonomy in 610 (as described relative to 304 in FIG.3). The process 600 continues to 612 in which the concept network 100 isproduced and displayed to the user.

The concept network 100 is capable of being produced to return accurateresponses to such queries as “explain the word: electronics” (whichprior-art search engines can not perform). Such concept networks 100 arealso generated (as is the case in the FIG. 5 query) by analyzing thestructure of the various Web sites and Web pages. One embodiment of theconcept network saves the structure information of Web sites, whichrepresents the editor's view on the hierarchy of the concept. In theconcept network 100, different editor's views are merged together, sousers can determine what is the most popular explanation.

Certain other embodiments of the concept network 100 can provide a querythat have determined the best Web sites for performing a task such asexplaining the word “electronics”. This type of query can be considereda query to explain and/or compare. As such, a number of Web sites haveto be evaluated and compared by the concept network. One mechanism thatis involved to produce such concept networks (such as a concept networkthat can explain complex issues) involves considering a large number ofWeb sites that relate to the issue posed by the query; somehowmeasurably considering the relevancy of each Web site as is performed byprior-art search engines, and then displaying to the user of the conceptnetwork relevant portions of the Web page. The FIG. 5 embodiment ofprocess 600 can be used to perform this type of query as well.

To respond to these types of relatively complex queries (either thetop-N type query or a query that has to evaluate and compare multipleWeb sites, etc.), the concept network 100 is constructed by evaluatingthe structure of each Web page or Web site considered. The prior-artsearch engines are not capable of deriving the structure from the Websites in order to perform these analyses (and therefore are not able torespond to such queries). For example, relative to the electronicsexample, the concept network considers those Web pages that arestructured to provide enough information that is sufficiently directedat accurately describing the electronics topic.

The concept network 100 is also very useful in query expansion.Currently, many Internet applications utilize the prior-art manual toolWordNet to expand the user's query to improve the precision of existingsearch engines. WordNet, however, is built which is a labor-intensivejob to construct (the thesaurus) manually. Almost no Web site prefers toconstruct the thesaurus by hand. The Web site operators prefer toautomate thesaurus construction. Manual thesaurus construction by theuser is not suitable for the rapid growth of the Internet. The number ofdocuments in such networks as the Internet continue to grow. More andmore new words and concepts continue appearing that highlight theusefulness of the concept network as described in the presentdisclosure. The concept network returns fewer, but more directed,results as compared to the prior-art search engines that rely on keywordcomparison. As such, it becomes easier for a user to evaluate eachresult returned by the concept network. In addition, it becomes easierfor a user to evaluate if a query is not returning the desired types ofresults, and therefore the user will be able to modify the originalquery to be more directed.

A live thesaurus (which the concept network can function as) is usefulfor Internet and other network searches. Moreover, the concept network100 not only contains the hierarchy of the concepts, but also containsthe statistical information for these concepts. So it can be easilyapplied to some specific questions about popularity, such as a survey.

Since one embodiment of the concept network 100 merges the view on thewords and concepts from all the authors for the Internet and othernetwork environments, the concept network 100 may be considered asproviding an alternative thesaurus for the network users. The conceptnetwork 100 can be adapted to the client side as a personal thesaurus.The user's browsing paths will generate a sub-space of the Web. Asimilar method can be applied to analyze the sub-space of the Web togenerate the relationships of personal frequently used concepts.

The concept network therefore provides for summarization for a Web page.Text over the hyperlink and the page title can be used as thesummarization of a Web page. In another embodiment, a natural languageparse (NLP) technique can be integrated in the Web site search portion201 (perhaps as a portion of the HTML parser 216) to summarize thedocument using some dominant keywords.

This disclosure describes a variety of the concept networks 100. Theconcept network can be considered as an Internet concept network builtfrom the Web site by analyzing the structure of a plurality of the Websites and merging the analysis results. The concept network 100 may beespecially useful for improving the precision and speed of searchengine. The concept network extracts knowledge from Web site structurerather than solely plain text contained within the Web site. The conceptnetwork provides automatic construction for a domain. The statisticalresults from a concept network reveal the general knowledge contained ina variety of Web sites.

As such, the concept network not only obtains information from aparticular Web site, but obtains knowledge from a large variety of Websites over the network. The concept network can use ontology learning tomaintain the structure information relating to the Web sites. Therefore,as new Web pages and concepts are applied to the Internet, ontologyallows the structural information from the Web pages to be automaticallyintegrated into the concept network. Furthermore, the concept network100 can provide some services that common search engine can not do, suchas “find out the Top N digital camera manufacturers in the world” and“explain the word: electronics”. The concept network can also functionas a live Internet thesaurus for query expansion since it provides sucha variety of sub Web blocks that are related to each other through aprimary Web block as illustrated in FIG. 3.

FIG. 6 illustrates an example of a suitable computer environment ornetwork 500 which includes a user interface that can produce a conceptnetwork. The computer environment 500 represents one embodiment of thecomputer environment 50 illustrated in FIG. 1. Similar resources may usethe computer environment and the processes described herein.

The computer environment 500 illustrated in FIG. 6 is a general computerenvironment, which can be used to implement the concept networktechniques described herein. The computer environment 500 is only oneexample of a computer environment and is not intended to suggest anylimitation as to the scope of use or functionality of the computer andnetwork architectures. Neither should the computer environment 100 beinterpreted as having any dependency or requirement relating to any oneor combination of components illustrated in the exemplary computerenvironment 500.

The computer environment 100 includes a general-purpose computing devicein the form of a computer 502. The computer 502 can include, forexample, one or more from the group including a stand alone computer, anetworked computer, a mainframe computer, a PDA, a telephone, amicrocomputer or microprocessor, or any other computer device that usesa processor in combination with a memory. The components of the computer502 can include, but are not limited to, one or more processors orprocessing units 504 (optionally including a cryptographic processor orco-processor), a system memory 506, and a system bus 508 that couplesvarious system components including the processor 504 and the systemmemory 506.

The system bus 508 represents one or more of any of several types of busstructures, including a memory bus or memory controller, a peripheralbus, an accelerated graphics port, and a processor or local bus usingany of a variety of bus architectures. By way of example, sucharchitectures can include an Industry Standard Architecture (ISA) bus, aMicro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, aVideo Electronics Standards Association (VESA) local bus, and aPeripheral Component Interconnects (PCI) bus also known as a Mezzaninebus.

The computer 502 typically includes a variety of computer readablemedia. Such media can be any available media that is accessible by thecomputer 502 and includes both volatile and non-volatile media, andremovable and non-removable media.

The system memory 506 includes the computer readable media in the formof non-volatile memory such as read only memory (ROM) 512, and/orvolatile memory such as random access memory (RAM) 510. A basicinput/output system (BIOS) 514, containing the basic routines that helpto transfer information between elements within the computer 502, suchas during start-up, is stored in the ROM 512. The RAM 510 typicallycontains data and/or program modules that are immediately accessible to,and/or presently operated on, by the processing unit 504.

The computer 502 may also include other removable/non-removable,volatile/non-volatile computer storage media. By way of example, FIG. 6illustrates a hard disk drive 515 for reading from and writing to anon-removable, non-volatile magnetic media (not shown), a magnetic diskdrive 518 for reading from and writing to a removable, non-volatilemagnetic disk 520 (e.g., a “floppy disk”), and an optical disk drive 522for reading from and/or writing to a removable, non-volatile opticaldisk 524 such as a CD-ROM, DVD-ROM, or other optical media. The harddisk drive 515, magnetic disk drive 518, and optical disk drive 522 areeach connected to the system bus 508 by one or more data mediainterfaces 527. Alternatively, the hard disk drive 515, magnetic diskdrive 518, and optical disk drive 522 can be connected to the system bus508 by one or more interfaces (not shown).

The disk drives and their associated computer-readable media providenon-volatile storage of computer readable instructions, control nodedata structures, program modules, and other data for the computer 502.Although the example illustrates a hard disk within the hard disk drive515, a removable magnetic disk 520, and a non-volatile optical disk 524,it is to be appreciated that other types of the computer readable mediawhich can store data that is accessible by a computer, such as magneticcassettes or other magnetic storage devices, flash memory cards, CD-ROM,digital versatile disks (DVD) or other optical storage, random accessmemories (RAM), read only memories (ROM), electrically erasableprogrammable read-only memory (EEPROM), and the like, can also beutilized to implement the exemplary computer environment 500.

Any number of program modules can be stored on the hard disk containedin the hard disk drive 515, magnetic disk 520, non-volatile optical disk524, ROM 512, and/or RAM 510, including by way of example, the OS 526,one or more application programs 528, other program modules 530, andprogram data 532. Each OS 526, one or more application programs 528,other program modules 530, and program data 532 (or some combinationthereof) may implement all or part of the resident components thatsupport the distributed file system.

A user can enter commands and information into the computer 502 viainput devices such as a keyboard 534 and a pointing device 536 (e.g., a“mouse”). Other input devices 538 (not shown specifically) may include amicrophone, joystick, game pad, satellite dish, serial port, scanner,and/or the like. These and other input devices are connected to theprocessing unit 504 via input/output interfaces 540 that are coupled tothe system bus 508, but may be connected by other interface and busstructures, such as a parallel port, game port, or a universal serialbus (USB).

A monitor, flat panel display, or other type of a computer display 200can also be connected to the system bus 508 via an interface, such as avideo adapter 544. In addition to the computer display 200, other outputperipheral devices can include components such as speakers (not shown)and a printer 546 which can be connected to the computer 502 via theinput/output interfaces 540.

The computer 502 can operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computerdevice 548. By way of example, the remote computer device 548 can be apersonal computer, portable computer, a server, a router, a networkcomputer, a peer device or other common network node, game console, andthe like. The remote computer device 548 is illustrated as a portablecomputer that can include many or all of the elements and featuresdescribed herein relative to the computer 502.

Logical connections between the computer 502 and the remote computerdevice 548 are depicted as a local area network (LAN) 550 and a generalwide area network (WAN) 552. Such networking environments arecommonplace in offices, enterprise-wide computer networks, intranets,and the Internet.

When implemented in a LAN networking environment, the computer 502 isconnected to a local network 550 via a network interface or adapter 554.When implemented in a WAN networking environment, the computer 502typically includes a modem 556 or other means for establishingcommunications over the wide network 552. The modem 556, which can beinternal or external to the computer 502, can be connected to the systembus 508 via the input/output interfaces 540 or other appropriatemechanisms. It is to be appreciated that the illustrated networkconnections are exemplary and that other means of establishingcommunication link(s) between the computers 502 and 548 can be employed.

In a networked environment, such as that illustrated with the computer 7environment 500, program modules depicted relative to the computer 502,or portions thereof, may be stored in a remote memory storage device. Byway of example, remote application programs 558 reside on a memorydevice of the remote computer 548. For purposes of illustration,application programs and other executable program components such as theoperating system are illustrated herein as discrete Web blocks, althoughit is recognized that such programs and components reside at varioustimes in different storage components of the computer 502, and areexecuted by the data processor(s) of the computer 502. It will beappreciated that the network connections shown and described areexemplary and other means of establishing a communications link betweenthe computers may be used.

Various modules and techniques may be described herein in the generalcontext of the computer-executable instructions, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, control objects 650,components, control node data structures 654, etc. that performparticular tasks or implement particular abstract data types. Typically,the functionality of the program modules may be combined or distributedas desired in various embodiments.

An implementation of these modules and techniques may be stored on ortransmitted across some form of the computer readable media. Computerreadable media can be any available media that can be accessed by acomputer. By way of example, and not limitation, computer readable mediamay comprise “computer storage media” and “communications media.”

“Computer storage media” includes volatile and non-volatile, removableand non-removable media implemented in any process or technology forstorage of information such as computer readable instructions, controlnode data structures, program modules, or other data. Computer storagemedia includes, but is not limited to, RAM, ROM, EEPROM, flash memory orother memory technology, CD-ROM, digital versatile disks (DVD) or otheroptical storage, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, or any other medium which canbe used to store the desired information and which can be accessed by acomputer.

“Communication media” typically embodies computer readable instructions,control node data structures, program modules, or other data in amodulated data signal, such as carrier wave or other transportmechanism. Communication media also includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared, and other wireless media. Combinations of any of the above arealso included within the scope of computer readable media.

Although systems, media, methods, approaches, processes, etc. have beendescribed in language specific to structural and functional featuresand/or methods, it is to be understood that the invention defined in theappended claims is not necessarily limited to the specific features ormethods described. Rather, the specific features and methods aredisclosed as exemplary forms of implementing the claimed invention.

1. A method comprising: under control of one or more processorsconfigured with executable instructions: receiving a plurality ofUniversal Resource Locators (URLs) of one or more Web sites thatcorrespond to a plurality of Web pages; analyzing the plurality of URLsto determine the content of the plurality of URLs and structureinformation of the one or more Web sites; and structuring a plurality ofweb blocks into a concept network based on information contained withinthe analyzed plurality of URLs and structure information of the one ormore Web sites.
 2. The method of claim 1, further comprising displayingthe concept network.
 3. The method of claim 2, further comprisingreceiving user input in response to the displayed concept network. 4.The method of claim 1, further comprising determining whether aparticular link is an upward link, a downward link, a sibling link, or acrosswise link based on a URL.
 5. A method comprising: under control ofone or more processors configured with executable instructions:accessing a plurality of domain-specific Web sites; mining informationbased on structure information and relative content of the plurality ofdomain-specific Web sites, the structure information comprisinginformation based at least in part on Universal Resource Locators (URLs)associated with the plurality of domain-specific Web sites, theinformation including link types of the URLs, the link types includingone or more of an offspring link, an ancestor link, or a sibling link;automatically constructing a domain specific taxonomy based on the minedinformation including the link types; and formulating a concept networkbased on the domain specific taxonomy.
 6. The method of claim 5, whereinthe information mining is based on link structure and content.
 7. Themethod of claim 5, wherein the concept network is formulated based onentropy.
 8. The method of claim 5, wherein the concept network isformulated based on mutual information.
 9. The method of claim 5,wherein the concept network is formulated based on similarity.
 10. Amethod comprising: under control of one or more processors configuredwith executable instructions: generating a concept network, thegeneration including: analyzing structure information about a first Website and a second Web site, the structure information comprisinginformation based at least in part on Universal Resource Locators (URLs)associated with the first Web site and the second Web site; structuringURLs in the first Web site into a first set of concept network nodes andURLs in the second Web site into a second set of concept network nodesbased on a result of the analyzed structure information; and merging thefirst set of concept network nodes with the second set of conceptnetwork nodes based on a measure of mutual information between at leastone of the first set of concept network nodes and at least one of thesecond set of concept network nodes.
 11. The method of claim 10, whereinthe structural information comprises information based at least in parton hidden concepts within each Web page.
 12. A method comprising: undercontrol of one or more processors configured with executableinstructions: generating a concept network, the generation includinganalyzing structure information about a plurality of data storagelocations based on a query submitted from a user, the structureinformation comprising information based at least in part on UniversalResource Locators (URLs) associated with the plurality of data storagelocations; grouping a first set of data storage locations into a firstset of concept network nodes and a second set of data storage locationsinto a second set of concept network nodes; and merging the first set ofconcept network nodes with the second set of concept network nodes basedon a measure of mutual information between at least one of the first setof concept network nodes and at least one of the second set of conceptnetwork nodes.
 13. The method of claim 12, wherein the structuralinformation comprises information based at least in part on hiddenconcepts within each data storage locations.
 14. The method of claim 12,wherein the data storage location comprises a Web page.
 15. A computerreadable medium having computer executable instructions for generating aconcept network, comprising: analyzing structure information about afirst Web site and a second Web site based on a query submitted from auser, the structure information comprising information based at least inpart on Universal Resource Locators (URLs) associated with the first Website and the second Web site; structuring URLs in the first Web siteinto a first set of concept network nodes and URLs in the second Website into a second set of concept network nodes based on a result of theanalyzed structure information; and merging the first set of conceptnetwork nodes with the second set of concept network nodes based on ameasure of mutual information between at least one of the first set ofconcept network nodes and at least one of the second set of conceptnetwork nodes.
 16. A method comprising: under control of one or moreprocessors configured with executable instructions: automaticallyderiving a domain specific taxonomy, the deriving including: analyzingstructure information about a plurality of data storage locations basedon a query submitted from a user, the structure information comprisinginformation based at least in part on Universal Resource Locators (URLs)associated with the plurality of data storage locations; grouping afirst set of data storage locations into a first set of hierarchicalconcept network nodes and a second set of data storage locations into asecond set of hierarchical concept network nodes; and merging the firstset of hierarchical concept network nodes with the second set ofhierarchical concept network nodes based on a measure of similaritybetween at least one of the first set of hierarchical concept networknodes and at least one of the second set of hierarchical concept networknodes.